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opposed  to  across  partitions.  A  natural,  classic  and  popular  statistical  setting  for  evaluating  solutions  to  this  problem 
is  the  stochastic  block  model,  also  referred  to  as  the  planted  partition  model. 

In  this  paper  we  present  a  new  algorithm  -  a  convexified  version  of  Maximum  Likelihood  -  for  graph  clustering.  We 
show  that,  in  the  classic  stochastic  block  model  setting,  it  outperforms  all  existing  methods  by  polynomial  factors;  in 
fact,  it  is  within  logarithmic  factors  of  known  lower  bounds  for  spectral  methods. 
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Improved  Graph  Clustering 

Yudong  Chen,  Sujay  Sanghavi,  Member,  IEEE,  and  Huan  Xu 


Abstract 

Graph  clustering  involves  the  task  of  partitioning  nodes,  so  that  the  edge  density  is  higher  within  partitions 
as  opposed  to  across  partitions.  A  natural,  classic  and  popular  statistical  setting  for  evaluating  solutions  to  this 
problem  is  the  stochastic  block  model,  also  referred  to  as  the  planted  partition  model. 

In  this  paper  we  present  a  new  algorithm  -  a  convexified  version  of  Maximum  Likelihood  -  for  graph  clustering. 
We  show  that,  in  the  classic  stochastic  block  model  setting,  it  outperforms  all  existing  methods  by  polynomial 
factors;  in  fact,  it  is  within  logarithmic  factors  of  known  lower  bounds  for  spectral  methods. 

We  then  show  that  this  guarantee  carries  over  to  a  more  general  semi-random  extension  of  the  stochastic  block 
model;  our  method  can  handle  settings  of  heterogeneous  degree  distributions,  unequal  cluster  sizes,  outlier  nodes, 
planted  k-cliques  etc. 


I.  Introduction 

This  paper  proposes  a  new  algorithm  for  the  following  task:  given  an  undirected  unweighted  graph, 
assign  the  nodes  into  disjoint  clusters  so  that  the  density  of  edges  within  clusters  is  higher  than  the 
edges  across  clusters.  Clustering  arises  in  applications  such  as  a  community  detection,  user  profiling,  link 
prediction,  collaborative  filtering  etc.  In  these  applications,  one  is  often  given  as  input  a  set  of  similarity 
relationships  (either  “1”  or  “0”)  and  the  goal  is  to  identify  groups  of  similar  objects.  For  example,  given 
the  friendship  relations  on  Facebook,  one  would  like  to  detect  tightly  connected  communities,  which  is 
useful  for  subsequent  tasks  like  customized  recommendation  and  advertisement. 

Graphs  in  modern  applications  have  several  characteristics  that  complicate  graph  clustering: 

.  Small  density  gap:  the  edge  density  across  clusters  is  only  a  small  additive  or  multiplicative  factor 
different  from  within  clusters; 

•  Sparsity:  the  graph  is  overall  very  sparse  even  within  clusters; 

.  Outliers:  there  may  exist  nodes  that  do  not  belong  to  any  clusters  and  are  loosely  connected  to  the 
rest  of  the  graph; 

•  High  dimensionality:  the  number  of  clusters  may  grow  unbounded  as  a  function  of  the  number  of 
nodes  n,  which  means  the  sizes  of  the  clusters  can  be  vanishingly  small  compared  to  n; 

•  Heterogeneity:  the  cluster  sizes,  node  degrees  and  edge  densities  may  be  non-uniform;  there  may 
even  exist  edges  that  are  not  well-modeled  by  probabilistic  distributions  as  well  as  hierarchical  cluster 
structures. 

Various  large  modem  datasets  and  graphs  have  such  characteristics  [1,  2];  examples  include  the  web 
graph,  social  graphs  of  various  social  networks  etc.  As  has  been  well-recognized,  these  characteristics 
make  clustering  more  difficult.  When  the  difference  between  in-cluster  and  across-cluster  edge  densities 
is  small,  the  clustering  structure  is  less  significant  and  thus  harder  to  detect.  Sparsity  further  reduces  the 
amount  of  information  and  makes  the  problem  noisier.  In  the  high  dimensional  regime,  there  are  many 
small  clusters,  which  are  easy  to  lose  in  the  noise.  Heterogeneous  and  non-random  structures  in  the  graphs 
foil  many  algorithms  that  otherwise  perform  well.  Finally,  the  existence  of  hierarchy  and  outliers  renders 

The  work  of  Y.  Chen  was  supported  by  NSF  grant  EECS-1056028  and  DTRA  grant  HDTRA  1-08-0029.  The  work  of  S.  Sanghavi  was 
supported  by  NSF  grant  1017525,  an  Army  Research  Office  (ARO)  grant  W91  INF-1 1-1-0265,  and  a  DTRA  Young  Investigator  award.  The 
work  of  H.  Xu  was  partially  supported  by  the  Ministry  of  Education  of  Singapore  through  AcRF  Tier  Two  grant  R-265-000-443-1 12  and 
NUS  startup  grant  R-265-000-384-133. 

Y.  Chen  and  S.  Sanghavi  are  with  the  Department  of  Electrical  and  Computer  Engineering,  The  University  of  Texas  at  Austin,  Austin,  TX 
78712  USA.  H.  Xu  is  with  the  Department  of  Mechanical  Engineering,  National  University  of  Singapore,  9  Engineering  Drive  1,  Singapore 
117575,  Singapore,  (e-mail:  ydchen@utexas.edu,  sanghavi@mail.utexas.edu,  mpexuh@nus.edu.sg) 
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many  existing  algorithms  and  theoretical  results  inapplicable,  as  they  fix  the  number  of  clusters  a  priori 
and  force  each  node  to  be  assigned  to  a  cluster.  It  is  desirable  to  design  an  algorithm  that  can  handle  all 
these  issues  in  a  principled  manner. 

A.  Our  Contributions 

Our  algorithmic  contribution  is  a  new  method  for  unweighted  graph  clustering.  It  is  motivated  by  the 
maximum-likelihood  estimator  for  the  classical  Stochastic  Blockmodel  [3]  (a.k.a.  the  Planted  Partition 
Model  [4])  for  random  clustered  graphs.  In  particular,  we  show  that  this  maximum-likelihood  estimator 
can  be  written  as  a  linear  objective  over  combinatorial  constraints;  our  algorithm  is  a  convex  relaxation 
of  these  constraints,  yielding  a  convex  program  overall.  While  this  is  the  motivation,  it  performs  well  - 
both  in  theory  and  practice  -  in  settings  that  are  not  just  the  standard  stochastic  blockmodel. 

Our  main  analytical  result  in  this  paper  is  theoretical  guarantees  on  its  performance;  we  study  it  in  a 
semi-random  generalized  stochastic  blockmodel.  This  model  generalizes  not  only  the  standard  stochastic 
blockmodel  and  planted  partition  model,  but  many  other  classical  planted  models  including  the  planted 
/.■-clique  model  [5,  6],  the  planted  coloring  model  [7,  8]  and  their  semi-random  variants  [9,  10,  11].  Our 
main  result  gives  the  condition  (as  a  function  of  the  in/cross-cluster  edge  densities  p  and  q,  density  gap 
\p  —  q\,  cluster  sizes  K  and  numbers  of  inliers/outliers  n  \  and  n2)  under  which  our  algorithm  is  guaranteed 
to  recover  the  ground-truth  clustering.  When  p  >  q,  the  condition  reads 

(  y/p{  1  q)(n1+n2) 

p-q  =  nl  - — - 

here  all  the  parameters  are  allowed  to  scale  with  n  =  n±  +  n2 ,  the  total  number  of  nodes.  An  analogue 
result  holds  for  p  <  q. 

While  the  planted  and  stochastic  block  models  have  a  rich  literature,  this  single  result  shows  that  our 
algorithm  outperforms  every  existing  method  for  the  standard  planted  parti tion/A'-cliquc/noisy-coloring 
models,  and  matches  them  (up  to  at  most  logarithmic  factors)  in  all  other  cases,  in  the  sense  that  our 
algorithm  succeeds  for  a  bigger  range  of  the  parameters.  In  fact,  there  is  evidence  indicating  that  we 
are  close  to  the  boundary  at  which  any  polynomial-time  algorithm  can  be  expected  to  work.  Moreover, 
the  proof  for  our  main  theorem  is  relatively  simple,  relying  only  on  standard  concentration  results.  Our 
simulation  study  supports  our  theoretic  finding,  that  the  proposed  method  is  effective  in  clustering  noisy 
graphs  and  outperforms  existing  methods. 

The  rest  of  the  paper  is  organized  as  follows:  Section  I-B  provides  an  overview  of  related  work;  Sec¬ 
tion  II  presents  our  algorithm;  Section  III  describes  the  Semi-Random  Generalized  Stochastic  Blockmodel 
(which  is  a  generalization  of  the  standard  stochastic  blockmodel,  one  that  allows  the  modeling  of  the  issues 
mentioned  above);  Section  IV  presents  the  main  results  -  a  performance  analysis  of  our  algorithm  for  the 
generalized  stochastic  blockmodel  and  a  detailed  comparison  to  the  existing  literature  on  this  problem; 
Section  V  provides  simulation  results;  finally,  the  proof  of  our  theoretic  results  is  given  in  Sections  VI 
to  IX. 

B.  Related  Work 

The  general  field  of  clustering,  or  even  graph  clustering,  is  too  vast  for  a  detailed  survey  here;  we  focus 
on  the  most  related  threads,  and  therein  too  primarily  on  work  which  provides  analytical  guarantees  on 
the  resulting  algorithms. 

Stochastic  block  models:  Also  called  “planted  models”  [3,  4],  these  are  perhaps  the  most  natural 
random  clustered  graph  models.  In  the  simplest  or  standard  setting,  nodes  are  partitioned  into  disjoint 
equal-sized  subsets  (called  the  underlying  clusters),  and  then  edges  are  generated  independently  and  at 
random,  with  the  probability  p  of  an  edge  between  two  nodes  in  the  same  cluster  higher  than  the  probability 
q  when  the  two  nodes  are  in  different  clusters.  The  algorithmic  clustering  task  in  this  setting  is  to  recover 
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TABLE  I 

Comparison  with  literature  for  the  standard  Stochastic  Blockmodel 


Paper  Cluster  size  I\ 


Boppana  (1987)  [15] 

n/2 

Jerrum  et  al.  (1998)  [16] 

n/2 

Condon  et  al.  (2001)  [3] 

0(n) 

Carson  et  al.  (2001)  [17] 

n/2 

Feige  et  al.  (2001)  [10] 

n/2 

McSherry  (2001)  [5] 

n  (n2/3) 

Bollobas  (2004)  [9] 

e(n) 

Giesen  et  al.  (2005)  [18] 

G  (fn) 

Shamir  (2007)  [19] 

Q.  (fn  log  n) 

Coja-Oghlan  (2010)  [20] 

fi(n4/5) 

Rohe  et  al.  (2011)  [21] 

fl  ^(nlogn)2 

Oymak  et  al.  (2011)  [22] 

G  (v/n) 

Chaudhuri  et  al.  (2012)  [12] 

Q  ( y/n  log  n) 

Ames  (2012)  [23] 

G  (fn) 

Density  gap  p  —  q 


^(/i r 

0.  i 


n 


n  !/6-« 

1 

a1/2- e 


n(v^) 

«(>/£) 


Sparsity  p 


Cl  {n1/6~e 
Cl(n1/2~e 


«(i) 


n 

n 


iC 


Our  result  O  (yTi) 


n 


\Klj 


To  facilitate  direct  comparison,  this  table  specializes  some  of  the  results  to  the  case  where  every  underlying  partition  (i.e.  cluster)  is  of 
the  same  size  K,  and  the  in/cross-cluster  edge  probabilities  are  uniformly  p  and  q.  Some  of  the  algorithms  above  need  this  assumption,  and 
some  -  like  ours  -  do  not. 

"The  soft  Cl  (■)  notation  ignores  log  factors. 


the  underlying  clusters  given  the  graph.  The  parameters  p,  q  and  the  size  K  of  the  smallest  cluster  typically 
govern  whether  an  algorithm  can  do  this  clustering,  or  not. 

There  is  now  a  long  line  of  analytical  work  on  stochastic  block  models;  we  focus  on  methods  that 
allow  for  exact  recovery  (i.e.  every  node  is  correctly  classified),  and  summarize  the  conditions  required 
by  known  methods  in  Table  I.  As  can  be  seen,  we  improve  over  all  existing  methods  by  polynomial  factors. 
In  addition,  and  as  opposed  to  several  of  these  methods,  we  can  handle  outliers,  heterogeneity,  hierarchy 
in  clustering  etc.  A  complimentary  line  of  work  has  investigated  lower  bounds  in  this  setting;  i.e.,  for 
what  values/scalings  of  p,  q  and  K  is  it  not  possible  (either  for  any  algorithm,  or  for  any  polynomial  time 
algorithm)  to  recover  the  underlying  clusters  [12,  13,  14].  We  discuss  these  two  lines  of  work  in  more 
details  in  the  main  results  section. 

Convex  methods  for  matrix  decomposition:  Our  method  is  related  to  recent  literature  on  the  recovery 
of  low-rank  matrices  using  convex  optimization,  and  in  particular  the  recovery  of  such  matrices  from 
“sparse”  perturbations  (i.e.  where  a  fraction  of  the  elements  of  the  low-rank  matrix  are  possibly  arbitrarily 
modified,  while  others  are  untouched).  Sparse  and  low-rank  matrix  decomposition  using  convex  optimiza¬ 
tion  was  initiated  by  [24,  25];  follow-up  works  [26,  27]  have  the  current  state-of-the-art  guarantees  on 
this  problem,  and  [28]  applies  it  directly  to  graph  clustering. 

The  method  in  this  paper  is  Maximum  Likelihood,  but  it  can  also  be  viewed  as  a  weighted  version 
of  sparse  and  low-rank  matrix  decomposition,  with  different  elements  of  the  sparse  part  penalized  dif¬ 
ferently,  based  on  the  given  input  graph.  There  is  currently  no  literature  or  analysis  of  weighted  matrix 
decomposition;  in  that  sense,  while  our  weights  have  a  natural  motivation  in  our  setting,  our  results  are 
likely  to  have  broader  implications,  for  example  robust  versions  of  PC  A  when  not  all  errors  are  created 
equal,  but  have  a  corresponding  prior. 
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II.  Algorithm 

We  now  describe  our  algorithm;  as  mentioned,  it  is  a  convex  relaxation  of  Maximum  Likelihood  (ML) 
as  applied  to  the  standard  stochastic  blockmodel.  So,  in  what  follows,  we  first  develop  notation  and  the 
exact  ML  estimator,  and  then  its  relaxation. 

ML  for  the  standard  stochastic  blockmodel:  Recall  that  in  the  standard  stochastic  blockmodel  nodes 
are  partitioned  into  disjoint  clusters,  and  edges  in  the  graph  are  chosen  independently;  the  probability  of 
an  edge  between  a  pair  of  nodes  in  the  same  cluster  is  p,  and  for  a  pair  of  nodes  in  different  clusters  it 
is  q.  Given  the  graph,  the  task  is  to  find  the  underlying  clusters  that  generated  it.  To  write  down  the  ML 
estimator  for  this,  let  us  represent  any  candidate  partition  by  a  corresponding  cluster  matrix  Y  e  Mnxn 
where  yl3  =  1  if  and  only  if  nodes  i  and  j  are  in  the  same  cluster,  and  0  otherwise1.  Any  Y  thus  needs 
to  have  a  block-diagonal  structure,  with  each  block  being  all  l’s. 

A  vanilla  ML  estimator  then  involves  optimizing  a  likelihood  subject  to  the  combinatorial  constraint 
that  the  search  space  is  the  cluster  matrices.  Let  A  e  Mnxn  be  the  observed  adjacency  matrix  of  the 
graph2;  then,  the  log  likelihood  function  of  A  given  Y  is 

iogP(A|Y)  =  log  Yl  paij -  pY~aij  IJ  qaii{^-qY~aij 

( iJ)-Vij=0 

We  notice  that  this  can  be  written,  via  a  re-arrangement  of  terms,  as 

logP(A|Y)  =  log  f-)  Y  VH  ~  loS  Vii  +  C]  C1) 

ai3= 1  ^  aij=0 


here  C  collects  the  terms  that  are  independent  of  Y .  ML  estimator  would  be  maximizing  the  above 
expression  subject  to  Y  being  a  cluster  matrix.  While  the  objective  is  a  linear  function  of  Y,  this 
optimization  problem  is  combinatorial  due  to  the  requirement  that  Y  be  a  cluster  matrix  (i.e.,  block- 
diagonal  with  each  block  being  all-ones),  and  is  intractable  in  general. 

Our  algorithm:  We  obtain  a  convex  and  tractable  algorithm  by  replacing  the  constraint  “Y  is  a  cluster 
matrix”  with  (i)  constraints  0  <  ytJ  <  1  for  all  elements  i,  j,  and  (ii)  a  nuclear  norm3  regularizer  ||Y||*  in 
the  objective.  The  latter  encourages  Y  to  be  low-rank,  and  is  based  on  the  well-established  insight  that  a 
cluster  matrix  has  low  rank  -  in  particular,  its  rank  equals  to  the  number  of  clusters.  (We  discuss  other 
related  relaxations  after  we  present  our  algorithm.) 

Also  notice  that  the  likelihood  expression  (1)  is  linear  in  Y  and  only  the  ratio  of  the  two  coefficients 
l°g (p/q)  and  log((l  —  q)/{  1  —  p))  is  important.  We  thus  introduce  a  parameter  t  which  allows  us  to 
choose  any  ratio.  This  has  the  advantage  that  instead  of  knowing  both  p  and  q,  we  only  need  to  choose 
one  number  t  (which  should  be  between  p  and  q\  we  remark  on  how  to  choose  t  later).  This  leads  to  the 
following  convex  formulation: 


max  c 4  Y  yij  -  cAc  Y  Pa  ~  llrIL 

y^|nxn  Y. J  J  /  J  J 


s.t.  0  <  ytJ  <  1,  Vi,  j. 


where  the  weights  cA  and  cAc  are  given  by 


ca  = 


and 


CAC  ~ 


(2) 

(3) 

(4) 


Here  the  factor  48  ^fn  balances  the  contributions  of  the  nuclear  norm  and  the  likelihood,  and  the  specific 
forms  of  cA  and  cAc  are  derived  from  our  analysis.  The  optimization  problem  (2)— (3)  is  convex  and  can 


*We  adopt  the  convention  that  ya  =  1  for  any  node  i  that  belongs  to  a  cluster. 
2We  assume  an  =  1  for  all  i. 

’The  nuclear  norm  of  a  matrix  is  the  sum  of  its  singular  values. 
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be  cast  as  a  Semidefinite  Program  (SDP)  [24,  29].  More  importantly,  it  can  be  solved  using  efficient 
first-order  methods  for  large  graphs  (see  Section  V-A). 

Our  algorithm  is  given  as  Algorithm  1.  Depending  on  the  given  A  and  the  choice  of  t,  the  optimal 
solution  Y  may  or  may  not  be  a  cluster  matrix.  Checking  if  a  given  Y  is  a  cluster  matrix  can  be  done 
easily,  e.g.,  via  an  SVD,  which  will  also  reveal  cluster  memberships  if  it  is  a  cluster  matrix.  If  it  is  not, 
any  one  of  several  rounding/aggregation  ideas  (e.g.,  the  one  in  [30])  can  be  used  empirically;  we  do  not 
delve  into  this  approach  in  this  paper,  and  simply  output  failure.  In  Section  IV  we  provides  sufficient 
conditions  under  which  Y  is  a  cluster  matrix,  with  no  rounding  required. 


Algorithm  1  Convex  Clustering 
Input:  A  e  Rnxn,  t  E  (0, 1) 

Solve  program  (2)-(3)  with  weights  (4).  Let  Y  be  an  optimal  solution, 
if  Y  is  a  cluster  matrix  then 

Output  cluster  memberships  obtained  from  Y. 
else 

Output  “Failure”. 

end  if 


A.  Remarks  about  the  Algorithm 

Note  that  while  we  derive  our  algorithm  from  the  standard  stochastic  blockmodel,  our  analytical  results 
hold  in  a  much  more  general  setting.  In  practice,  one  could  execute  the  algorithm  (with  appropriate  choice 
of  t,  and  hence  cj,  and  )  on  any  given  graph. 

Tighter  relaxations:  The  formulation  (2)-(3)  is  not  the  only  way  to  relax  the  non-convex  ML  estimator. 
Instead  of  the  nuclear  norm  regularizer,  a  hard  constraint  || F||*  <  n  may  be  used.  One  may  further  replace 
this  constraint  with  the  positive  semidefinite  constraint  Y  A  0  and  the  linear  constraints  yu  =  1,  both 
satisfied  by  any  cluster  matrix4.  It  is  not  hard  to  check  that  these  modifications  lead  to  convex  relaxations 
with  smaller  feasible  sets,  so  any  performance  guarantee  for  our  formulation  (2)-(3)  also  applies  to  these 
alternative  formulations.  We  choose  not  to  use  these  potentially  tighter  relaxations  based  on  the  following 
theoretical  and  practical  considerations:  a)  These  formulations  do  not  work  well  when  the  numbers  of 
outliers  and  clusters  are  unknown,  b)  We  do  not  obtain  better  theoretical  guarantees  with  them.  In  fact, 
the  work  [30]  considers  these  tighter  constraints  but  their  exact  recovery  guarantees  are  improved  by  ours. 
Moreover,  as  we  argue  in  the  next  section,  our  guarantees  are  likely  to  be  order-wise  optimal  and  thus 
any  alternative  convex  formulations  are  unlikely  provide  significant  improvements  in  a  scaling  sense,  c) 
Our  simpler  formulation  facilitates  efficient  solution  for  large  graphs  via  first-order  methods;  we  describe 
one  such  method  in  Section  V-A. 

Choice  oft:  Our  algorithm  requires  an  extraneous  input  t.  For  the  standard  planted  r-clique  problem  [6, 
5]  (with  r  cliques  planted  in  a  random  graph  Gn< i/2),  one  can  use  t  =  3/4  (see  Section  IV-C2).  For  the 
standard  stochastic  blockmodel  (with  nodes  partitioned  into  equal-size  clusters  and  edge  probabilities 
being  uniformly  p  and  q  inside  and  across  clusters),  the  value  of  t  can  be  easily  computed  from  data  (see 
Section  IV-D).  In  these  cases,  our  algorithm  has  no  tuning  parameters  whatsoever  and  does  not  require 
knowledge  of  the  number  or  sizes  of  the  clusters.  For  the  general  setting,  t  should  be  chosen  to  lie  between 
p  and  q,  which  now  represent  the  lower/upper  bounds  for  the  in/cross-cluster  edge  densities.  As  such, 
t  can  be  interpreted  as  the  resolution  of  the  clustering  algorithm.  To  see  this,  suppose  the  clusters  have 
a  hierarchical  structure,  where  each  big  cluster  is  partitioned  into  smaller  sub-clusters  with  higher  edge 
densities  inside.  In  this  case,  it  is  a  priori  not  clear  that  which  level  of  clusters,  the  larger  ones  or  the 
smaller  ones,  should  be  recovered.  This  ambiguity  is  resolved  by  specifying  t:  our  algorithm  recovers 

4  The  constraints  yu  =  1,  Vi  are  satisfied  when  there  is  no  outlier. 
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those  clusters  with  in-cluster  edge  density  higher  than  t  and  cross-cluster  density  lower  than  t.  With  a 
larger  t,  the  algorithm  operates  at  a  higher  resolution  and  detects  small  clusters  with  high  density.  By 
varying  t,  our  algorithm  can  be  turned  into  a  method  for  multi-resolution  clustering  [1]  which  explores 
all  levels  of  the  cluster  hierarchy.  We  leave  this  to  future  work.  Importantly,  the  above  example  shows 
that  it  is  generally  impossible  to  uniquely  determine  the  value  of  t  from  data. 

III.  The  Generalized  Stochastic  Blockmodel 

While  our  algorithm  above  is  derived  as  a  relaxation  of  ML  for  the  standard  stochastic  blockmodel, 
we  establish  performance  guarantees  in  a  much  more  general  setting,  which  is  defined  by  six  parameters 
n\,  n-2,  r,  K,  p  and  q;  it  is  described  below. 

Definition  1  (Generalized  Stochastic  Blockmodel  (GSBM)).  If  p  >  q  (p  <  q,  resp.),  consider  a  random 
graph  generated  as  follows:  The  n  =  rii  +  n2  nodes  are  divided  into  two  sets  V\  and  V2.  The  n  1  nodes 
in  Vj  are  further  partitioned  into  r  disjoint  sets,  which  we  will  refer  to  as  the  “true  ”  clusters.  Let  K  be 
the  minimum  size  of  a  true  cluster.  For  every  pair  of  nodes  i,j  that  belong  to  the  same  true  cluster,  edge 
(■ i,j )  is  present  in  the  graph  with  probability  that  is  at  least  (at  most,  resp.)  p,  while  for  every  pair  where 
the  nodes  are  in  different  clusters  the  edge  is  present  with  probability  at  most  (at  least,  resp.)  q.  The  other 
n2  nodes  in  V2  are  not  in  any  cluster  (we  will  call  them  outliers);  for  each  i  e  V2  and  j  <6  V\  U  V2,  there 
is  an  edge  between  the  pair  i,j  with  probability  at  most  (at  least,  resp.)  q. 

Definition  2  (Semi-random  GSBM).  On  a  graph  generated  from  GSBM  with  p  >  q  (p  <  q,  resp.),  an 
adversary  is  allowed  to  arbitrarily  (a)  add  (remove,  resp.)  edges  between  nodes  in  the  same  true  cluster, 
and  (b)  remove  (add,  resp.)  edges  between  pairs  of  nodes  if  they  are  in  different  clusters,  or  if  at  least 
one  of  them  is  an  outlier  in  V2. 

The  objective  is  to  find  the  underlying  true  clusters,  given  the  graph  generated  from  the  semi -random 
GSBM. 

The  standard  stochastic  blockmodel/planted  partition  model  is  a  special  case  of  GSBM  with  n2  —  0,  r  > 
2,  all  cluster  sizes  equal  to  K ,  and  all  in-cluster  (cross-cluster,  resp.)  probabilities  equal  to  p  (q,  resp.). 
GSBM  generalizes  the  standard  models  as  it  allows  for  heterogeneity  in  the  graph: 

.  p  and  q  are  lower  and  upper  bounds,  instead  of  exact  edge  probabilities; 

•  K  is  also  a  lower  bound,  so  clusters  can  have  different  sizes; 

•  outliers  (nodes  not  in  any  cluster)  are  allowed. 

GSBM  removes  many  restrictions  in  standard  planted  models  and  thus  better  models  practical  graphs. 

The  semi-random  GSBM  allows  for  further  modeling  power.  It  blends  the  worst  case  models,  which 
are  often  overly  pessimistic,5  and  the  purely  random  graphs,  which  are  extremely  unstructured  and  have 
very  special  properties  usually  not  possessed  by  real-world  graphs  [31].  This  semi-random  framework 
has  been  used  and  studied  extensively  in  the  computer  science  literature  as  a  better  model  for  real-world 
networks  [9,  10,  11].  At  first  glance,  the  adversary  seems  to  make  the  problem  easier  by  adding  in-cluster 
edges  and  removing  cross-cluster  edges  when  p  >  q.  This  is  not  necessarily  the  case.  The  adversary  can 
significantly  change  some  statistical  properties  of  the  random  graph  (e.g.,  alter  spectral  structure  and  node 
degrees,  and  create  local  optima  by  adding  dense  spots  [10]),  and  foil  algorithms  that  over-exploit  such 
properties.  In  fact,  some  spectral  algorithms  that  work  extremely  well  on  random  models  are  shown  to 
fail  in  the  semi-random  setting  [8].  An  algorithm  that  is  robust  against  an  adversary  is  more  likely  to 
work  well  on  real-world  graphs.  As  is  shown  later,  our  algorithm  processes  this  desired  property. 

A.  Special  Cases 

GSBM  recovers  as  special  cases  many  classical  and  widely  studied  models  for  clustered  graphs,  by 
considering  different  values  for  the  parameters  ri\,  n2,  r,  K,  p  and  q.  We  classify  these  models  into  two 
categories  based  on  the  relation  between  p  and  q. 

5For  example,  the  minimum  graph  bisection  problem  is  NP-hard. 
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1)  p  >  q:  GSBM  with  p  >  q  models  homophily,  the  tendency  that  individuals  belonging  to  the  same 

community  tend  to  connect  more  than  those  in  different  communities.  Special  cases  include: 

•  Planted  Clique  [32]:  p  —  1,  r  —  1  (so  n\  —  K )  and  ri2  >  0; 

•  Planted  r -Clique  [5]:  p  =  1  and  r  >  1; 

•  Stochastic  Blockmodel/Planted  Partition  [4,  3]:  n2  —  0,  r  >  2  with  all  cluster  sizes  equal  to  K. 

2)  p  <  q:  GSBM  with  p  <  q  models  heterophily.  Special  cases  include: 

•  Planted  Coloring  [10]:  q  >  p  —  0,  r  >  2,  and  n2  =  0; 

•  Planted  r -Cut/noisy  coloring  [9,  13]:  q  >  p  >  0,  r  >  2,  and  n2  =  0. 

In  the  next  two  sections,  we  describe  our  algorithm  and  provide  performance  guarantees  under  the  semi¬ 
random  GSBM.  This  implies  guarantees  for  all  the  special  cases  above.  We  provide  a  detailed  comparison 
with  literature  after  presenting  our  results. 

IV.  Main  Results:  Performance  Guarantees 

In  this  section  we  provide  analytical  performance  guarantees  for  our  algorithm  under  the  semi-random 
GSBM.  We  provide  a  unified  theorem,  and  then  discuss  its  consequences  for  various  special  cases,  and 
compare  with  literature.  We  also  discuss  how  to  estimate  the  parameter  t  in  the  special  case  of  the  standard 
stochastic  blockmodel.  We  shall  first  consider  the  case  with  p  >  q.  The  p  <  q  case  is  a  direct  consequence 
and  is  discussed  in  Section  IV-C3.  All  proofs  are  postponed  to  Sections  VI  to  IX. 


A.  A  Monotone  Lemma 

Our  optimization-based  algorithm  has  a  nice  monotone  property:  adding/removing  edges  “aligned  with” 
the  optimal  Y  (as  is  done  by  the  adversary)  cannot  result  in  a  different  optimum.  This  is  summarized  in 
the  following  lemma. 

Lemma  1.  Suppose  p  >  q  and  Y  is  the  unique  optimum  of  (2)-(3)  for  a  given  A.  If  now  we  arbitrarily 
change  some  edges  of  A  to  obtain  A,  by  (a)  choosing  some  edges  such  that  yt]  =  1  but  atJ  =  0,  and 
making  dij  =  1,  and  (b)  choosing  some  edges  where  yl}  =  0  but  ciij  =  1,  and  making  dij  =  0.  Then,  Y 
is  also  the  unique  optimum  of  (2 )-( 3)  with  A  as  the  input. 

The  lemma  shows  that  our  algorithm  is  inherently  robust  under  the  semi-random  model.  In  particular, 
the  algorithm  succeeds  in  recovering  the  true  clusters  on  the  semi-random  GSBM  as  long  as  it  succeeds 
on  GSBM  with  the  same  parameters.  In  the  sequel,  we  therefore  focus  solely  on  GSBM,  with  the 
understanding  that  any  guarantee  for  it  immediately  implies  a  guarantee  for  the  semi-random  variant. 


B.  Main  Theorem 

Let  Y*  be  the  matrix  corresponding  to  the  true  clusters  in  GSBM,  i.e.,  yf  =  1  if  and  only  if  i,j  e  V\ 
and  they  are  in  the  same  true  cluster,  and  0  otherwise.  The  theorem  below  establishes  conditions  under 
which  our  algorithm,  specifically  the  convex  program  (2)-(3),  yields  this  Y*  as  the  unique  optimum 
(without  any  further  need  for  rounding  etc.)  with  high  probability  (w.h.p.).  Throughout  the  paper,  with 
high  probability  means  with  probability  at  least  1  —  con~8  for  some  universal  absolute  constant  cq. 


Theorem  1.  Suppose  the  graph  A  is  generated  according  to  GSBM  with  p  >  q.  If  t  in  (4)  is  chosen  to 
satisfy 

13  3  1 

JP  +  J9<*<  JP+J9,  (5) 

then  Y*  is  the  unique  optimal  solution  to  the  convex  program  (2)-(3)  w.h.p.  provided 

p-q 


Vpi1  -  q) 


Jn  log 2n 
>  c,  max  <  — ,  -7¥ 


(6) 


where  ci  is  an  absolute  constant  independent  of  p,  q.  K  and  n. 

Our  theorem  quantifies  the  tradeoff  between  the  four  parameters  governing  the  hardness  of  GSBM-  the 
minimum  in-cluster  edge  density  p,  the  maximum  across-cluster  edge  density  q,  the  minimum  cluster  size 
K  and  the  number  of  outliers  no  —  n  —  n  \  -  required  for  our  algorithm  to  succeed,  i.e.,  to  recover  the 
underlying  true  clustering  without  any  error.  Note  that  we  can  handle  any  values  of  p,  q.  n2  and  K  as  long 
as  they  satisfy  the  condition  in  the  theorem;  in  particular,  they  are  allowed  to  scale  with  n.  Interestingly, 
the  theorem  does  not  have  an  explicit  dependence  on  the  number  of  clusters  r  except  via  the  relation 
rK  <  n. 

We  now  discuss  the  tightness  of  Theorem  1  in  terms  of  these  parameters.  When  K  =  O(n),  we  have 
a  matching  converse  result. 

Theorem  2.  Suppose  all  clusters  have  equal  size  K,  and  the  in-cluster  (cross-cluster,  resp.)  edge  prob¬ 
abilities  are  uniformly  p  (q,  resp.),  with  K  =  0(n)  and  n2  =  0(V/|).  Under  GSBM  with  p  >  q  and  n 
sufficiently  large,  for  any  algorithm  to  correctly  recover  the  clusters  with  probability  at  least  |,  we  must 
have 

P-Q  y  1 

Vpi1  -  o)  ~  2 

C2  is  an  absolute  constant. 

This  theorem  gives  a  necessary  condition  for  any  algorithm  to  succeed  regardless  of  its  computational 
complexity.  It  shows  that  Theorem  1  is  optimal  up  to  logarithmic  factors  for  all  values  of  p  and  q  when 

K  =  @(n). 

For  smaller  values  of  K ,  notice  that  Theorem  1  requires  K  to  be  f l(y/n),  since  the  left  hand  side  of 
(6)  is  less  than  1.  This  lower-bound  is  achieved  when  p  and  q  are  both  constants  independent  of  n  and 
K.  There  are  reasons  to  believe  that  this  requirement  is  unlikely  to  be  improvable  using  polynomial-time 
algorithms.  Indeed,  the  special  case  with  p  —  1  and  q  —  \  corresponds  to  the  classical  planted  clique 
problem  [32];  finding  a  clique  of  size  K  =  o(y/n )  is  widely  believed  to  be  computationally  hard  even  on 
average  and  has  been  used  as  a  hard  problem  for  cryptographic  applications  [33,  34]. 

For  other  values  of  p  and  q,  no  general  and  rigorous  converse  result  exists.  However,  there  are  evidences 
suggesting  that  no  other  polynomial-time  algorithm  is  likely  to  have  better  guarantees  than  our  result  (6). 
The  authors  of  [13]  show,  using  non-rigorous  but  deep  arguments  from  statistical  physics,  that  recovering 
the  clustering  is  impossible  in  polynomial  time  if  £==  =  o  ■  Moreover,  the  work  in  [14]  shows  that 
a  large  class  of  spectral  algorithms  fail  under  the  same  condition.  In  view  of  these  results,  it  is  possible 
that  our  algorithm  is  optimal  w.r.t.  all  polynomial-time  algorithms  for  all  values  of  p,  q  and  K . 

Several  further  remarks  regarding  Theorem  1  are  in  order. 

.  A  nice  feature  of  our  result  is  that  we  only  need  p  —  q  to  be  large  only  as  compared  to  fp;  several 
other  existing  results  (see  Table  I)  require  a  lower  bound  (as  a  function  of  n  and  K)  on  p  —  q  itself. 
When  K  is  @(n),  we  allow  p  and  p  —  q  to  be  as  small  as  0  (log4(n)/n). 

.  The  number  of  clusters  r  is  allowed  to  grow  rapidly  with  n;  this  is  called  the  high-dimensional 
setting  [21].  In  particular,  our  algorithm  can  recover  as  many  as  r  =  Q(y/n)  equal  size  clusters.  Any 
algorithm  with  a  better  scaling  would  recover  cliques  of  size  o(^/n),  an  unlikely  task  in  polynomial 
time  in  light  of  the  hardness  of  the  planted  clique  problem  discussed  above. 

.  The  number  of  outliers  can  also  be  large,  as  many  as  n2  =  O(n)  =  O(nf),  which  is  attained  when 
p  —  q,r  are  0(1)  and  K  =  ®(->/n2).  In  other  words,  almost  all  the  nodes  can  be  outliers,  and  this  is 
true  even  when  there  are  multiple  clusters  that  are  not  cliques  (i.e.,  p  <  1). 

•  Not  all  existing  methods  can  handle  non-uniform  edge  probabilities  and  node  degrees,  which  often 
require  special  treatment  (see  [12]).  This  issue  is  addressed  seamlessly  by  our  method  by  definition 
of  GSBM. 
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In  the  following  sub-section,  we  discuss  various  planted  problems  to  which  Theorem  1  applies  and 
compare  with  existing  work.  Our  results  match  the  best  existing  results  in  all  cases  (up  to  logarithm 
factors),  and  in  many  important  settings  lead  to  order-wise  stronger  guarantees. 


C.  Consequences  and  Comparison  with  Literature 

1)  Standard  Stochastic  Blockmodel  (a.k.a.  Planted  Partition  Model):  This  model  assumes  that  all 

clusters  have  the  same  size  K  with  no  isolated  nodes  (r/2  =  0),  and  the  in-cluster  and  across-cluster  edge 
probabilities  are  uniformly  p  and  q,  respectively,  with  p  >  q.  We  compare  our  result  to  past  approaches 
and  theoretical  results  in  Table  I.  Our  result  has  the  scaling  p  —  q  =  Q  and  p  =  Q  which 

improves  on  all  existing  results  by  polynomial  factors.  This  means  that  we  can  handle  much  noisier 
and  sparser  graphs,  especially  when  the  number  of  clusters  r  =  n/K  is  growing.  A  recent  paper  [35], 
which  appeared  after  the  publication  of  the  conference  version  [36]  of  this  manuscript,  proposes  a  tensor 
approach  for  graph  clustering.  Our  guarantee  is  a  few  logarithmic  factors  better  than  their  results  applied 
to  the  standard  stochastic  blockmodel. 

2)  Planted  r -Clique  Problem:  Here  the  task  is  to  find  a  set  of  r  disjoint  cliques,  each  of  size  at  least 
K,  that  have  been  planted  in  an  Erdos-Renyi  random  graphs  G(n,q).  Setting  p  =  1  in  Theorem  1,  we 
obtain  the  following  guarantee  for  the  planted  r-clique  problem. 


Corollary  1.  For  the  planted  r-clique  problem,  the  formulation  (2)-(4)  with  t  chosen  according  to 
Theorem  1  finds  the  hidden  cliques  w.h.p.  provided 


1  —  q  >  C3  max 


n  log4  n 
K*’  K 


? 


where  c3  is  an  absolute  constant. 


In  the  regime  where  r  is  allowed  to  scale  with  n  and  q  bounded  away  from  zero,  the  best  previous 
results  are  given  in  [5]  (1  —  q  —  and  in  [23]  (1  —  q  —  O(^)).  Corollary  1  is  stronger  than  both 

of  them  for  large  r. 

3)  The  Heterophily  Case  (p  <  q):  Given  a  graph  A  generated  from  the  semi-random  GSBM  with 
intra/inter-cluster  densities  p  <  q,  we  can  run  our  algorithm  to  the  graph  A'  =  11T  —  A,  where  11T  is  the 
all-one  matrix.  Note  that  A'  can  be  considered  as  generated  from  GSBM  with  intra/inter-cluster  densities 
p'  =  1  —  p  and  <f  =  1  —  q,  where  7/  >  q' .  With  this  reduction,  Lemma  1  and  Theorem  1  immediately 
yield  the  following  guarantee. 


Corollary  2.  Under  the  semi-random  GSBM  with  p  <  q,  the  formulation  (2)-(4)  applied  to  11'  —  /I  with 
t  obeying 

3  1  13 

sp+i?<  1  -t<-p+-q 


finds  the  true  clustering  w.h.p.  provided 

q  —  p  >  C3 y/(l  —  p)q  max 


a fn  log2  n  j 


where  c3  is  an  absolute  constant. 


This  corollary  immediately  yields  guarantees  for  the  planted  coloring  problem  [7]  and  the  planted  r- 
cut  [9]  (a.k.a.  planted  noisy  coloring  [13])  problem.  We  are  not  aware  of  any  exiting  work  that  explicitly 
considers  the  mirrored  GSBM  in  its  general  form  (n2  >  0,  1>  q  >p>  0,  and  K  =  0(n)  with  potential 
non-random  edges).  However,  since  any  guarantee  for  GSBM  with  p  >  q  implies  a  guarantee  for  GSBM 
with  p  <  q,  Table  I  provides  a  comparison  with  existing  work  when  n2  =  0  and  the  edge  probabilities 
and  cluster  sizes  are  uniform.  Again  our  guarantee  outperforms  all  existing  ones. 
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4)  Planted  Coloring  Problems:  A  special  case  of  the  above  problem  is  the  planted  coloring  problem, 
where  p  =  0  and  n2  =  0.  The  best  existing  result  q  =  Q  (-^  V  is  given  by  various  algorithms 

(e.g.,  [7,  8]).  By  Corollary  2,  our  algorithm  succeeds  provided  q  —  Q  V  We  match  the  best 

existing  algorithms  for  K  —  0(n/  log4(n)),  and  are  off  by  a  few  log  factors  for  larger  K . 

D.  Estimating  t  in  Special  Cases 

We  have  argued  that  specifying  t  in  a  completely  data-driven  way  is  ill-posed  for  the  general  GSBM, 
e.g.,  when  the  clusters  are  hierarchical.  However,  for  special  cases  this  can  be  done  reliably  with  strong 
guarantees.  Consider  the  standard  stochastic  blockmodel,  where  all  clusters  have  the  same  size  K,  the 
edge  probabilities  are  uniform  (i.e.,  equal  to  p  within  clusters  and  q  across  clusters,  with  p  >  q ),  and  there 
are  no  outliers  (n2  =  0)  or  non-random  edges.  Observe  that  E  [A]  —  (1  —  p)I  is  a  matrix  with  blocks  of 
p  and  q’ s6,  which  is  equal  to  the  Kronecker  product  of  a  K  x  K  all-one  matrix  and  an  r  x  r  circulant 
matrix  with  entries  equal  to  p  on  the  diagonal  and  q  elsewhere.  The  all-one  matrix  has  one  non-zero 
eigenvalue  K,  and  the  circulant  matrix  has  eigenvalues  (p  —  q)  +  rq  and  p  —  q  with  multiplicities  1  and 
r  —  1,  respectively.  The  eigenvalues  of  E  [A]  —  (1  —  p)I  is  the  product  of  these  two  matrices.  It  follows 
that  the  first  eigenvalue  of  E  [A]  is  K  (p  —  q)  +  nq  +  (1  —  p)  with  multiplicity  1,  the  second  eigenvalue  is 
K(p  —  q)  +  (1  —  '[>)  with  multiplicity  —  1,  and  the  third  eigenvalue  is  1  —p  with  multiplicities  (n  —  j^)  [18]. 
This  motivates  us  to  use  the  eigenvalues  of  the  observed  matrix  A  to  estimate  p,  q  and  t;  see  Algorithm  2. 


Algorithm  2  Estimation  of  t  from  data 


1) 

2) 

3) 


Compute  and  sort  the  eigenvalues  of  A,  denoted  as  Ai  >  A2  >  . . .  >  An. 

Let  r  =  arg maXj=2v..iri_i(Aj  —  Ai+i).  Set  K  =  n/r. 

Set 

(~  K\i+(n-K)\2-n 

J  1  n{K- 1)  ’ 


4)  Set  t  =  2±2. 


The  following  theorem  guarantees  that  the  estimation  errors  are  small. 

Theorem  3.  Under  the  standard  stochastic  blockmodel  and  the  condition  (6)  in  Theorem  1,  the  parameters 
estimated  in  Algorithm  2  satisfy  the  following  with  high  probability,  where  c4  is  an  absolute  positive 
constant: 


K  =K, 

r,~  n  .  s/pi1  ~  4)n 

max {|p  —  p\ ,  \q  —  q\\  <c4 - — - , 

13  3  1 

lP+lq<t<lP+lq. 


In  particular,  the  estimated  t  satisfies  the  condition  (5)  in  Theorem  1.  The  above  theorem  also  ensures 
that  Algorithm  2  is  a  consistent  estimator  of  the  parameters  p  and  q  when  condition  (6)  is  satisfied,  a 
result  of  independent  interest.  Combining  Theorem  1  and  Theorem  3,  we  obtain  a  complete  algorithm 
that  is  guaranteed  to  find  the  clusters  under  the  standard  stochastic  blockmodel  obeying  condition  (6), 
without  knowledge  of  any  generative  parameters  of  the  underlying  model. 


6Recall  that  we  use  the  convention  an  =  1. 
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V.  Empirical  Results 

A.  Implementation  Issues 

The  convex  program  (2)-(3)  can  be  solved  using  a  general  purpose  SDP  solver,  but  this  method  does 
not  scale  well  to  problems  with  more  than  a  few  hundred  nodes.  To  facilitate  fast  and  efficient  solution,  we 
propose  to  use  a  family  of  first-order  algorithms  called  Augmented  Lagrange  Multiplier  (ALM)  methods. 
Note  that  the  program  (2)-(3)  can  be  rewritten  as 

min  AUC1  o  ^lli  +  ||F||*  (7) 

Y,Se  R"x" 

s.t  Y  +  S=  A, 

0  <  Vij  <  1 

where  A  :=  ^7^,  the  matrix  C  e  Mnxn  satisfies  ctJ  =  C4  if  atJ  =  1  and  c,;?  =  cj,.  otherwise,  and 
o  denotes  the  element-wise  matrix  product.  This  problem  can  be  recognized  as  a  generalization  of  the 
standard  convex  formulation  of  the  low-rank  and  sparse  matrix  decomposition  problem  [25,  24],  of  which 
the  numerical  solution  has  been  well  studied.  We  adapt  the  ALM  method  in  [37]  to  the  above  problem, 
given  in  Algorithm  3.  Here  S€c(-)  '■  Knxn  >  Mnxn  is  the  element-wise  weighted  soft-thresholding  operator, 


Algorithm  3  ALM  for  Minimizing  Nuclear  Norm  plus  Weighted  iy  Norm 
Input:  A,Ce  Mnxn. 

Initialize:  =  0;  Y(0)  =  0~S^  =  0;  p0  >  0;  a  >  1;  k  =  0,  A  =  ^ 

while  not  converge  do 

(U,  E,  V)  =  svd(A  -  S (k)  +  pi  1M^). 
y(fe+1)  =  US^-i  (E)V. 

Lor  all  (i,j),  y[j+1)  =  max  jmin  | Y^+l\  l| ,  o|. 

S{k+1)  =  s^xc{A  -  y(fe+1)  +  ft1#)). 

M(fc+i)  =  M(k)  +  _  y(fc+i)  _  s(k+ D). 

Pk+i  =  otpk,  k  —  k  +  1. 

end  while 

Return  y(fc+1))  5'(fc+1). 


defined  as 

{3'ij  j , 

Xij  +  eCij, 

0, 

In  other  words,  it  shrinks  each  entry  of  X  towards  zero  by  ectj .  The  unweighted  version  Se(-)  =  S£/(-) 
is  also  used.  The  stopping  criteria  and  parameters  of  the  algorithm  are  chosen  similarly  to  [37]. 

B.  Simulations 

We  perform  experiments  on  synthetic  data,  and  compare  with  other  methods.  We  generate  a  graph  using 
the  stochastic  blockmodel  with  n  =  1000  nodes,  r  =  5  clusters  with  equal  size  K  =  200,  and  p,  q  G  [0, 1]. 
We  apply  our  method  to  the  graph,  where  we  pick  t  using  Algorithm  2  and  solve  the  optimization  problem 
using  Algorithm  3.  Due  to  numerical  accuracy,  the  output  Y  of  our  algorithm  may  not  be  strictly  integer, 
so  we  do  the  following  simple  rounding:  compute  the  mean  y  of  the  entries  of  Y,  and  round  each  entry 
of  Y  to  1  if  it  is  greater  than  y,  and  0  otherwise.  We  measure  the  error  by  ||Y*  —  round(Y)||i,  which 
equals  the  number  of  misclassified  pairs.  We  say  our  method  succeeds  if  it  misclassifies  less  than  0.1% 
of  the  pairs. 


if  >  ec^ 


if  Xij  < 

otherwise. 


-ec. 


v 
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Fig.  1.  (a)  Comparison  of  our  method  with  Single-Linkage  clustering  (SLINK),  spectral  clustering,  and  low-rank-plus-sparse  (L+S)  approach. 
The  area  above  each  curve  is  the  values  of  (p,  q)  for  which  a  method  successfully  recovers  the  underlying  true  clusters,  (b)  More  detailed 
results  for  the  area  in  the  box  in  (a).  The  experiments  are  conducted  on  synthetic  data  with  n  =  1000  nodes  and  r  =  5  clusters  with  equal 
size  K  =  200. 


For  comparison,  we  consider  three  alternative  methods:  (1)  Single-Linkage  clustering  (SLINK)  [38], 
which  is  a  hierarchical  clustering  method  that  merges  the  most  similar  clusters  in  each  iteration.  We  use 
the  difference  of  neighbors,  namely  ||  A*.  —  Aj.  ||i,  as  the  distance  measure  of  nodes  i  and  j,  and  terminate 
when  SLINK  finds  a  clustering  with  r  =  5  clusters.  (2)  A  spectral  clustering  method  [39],  where  we  run 
SLINK  on  the  top  r  =  5  singular  vectors  of  A.  (3)  The  low-rank-plus-sparse  approach  [28,  22],  followed 
by  the  rounding  scheme  described  in  the  last  paragraph.  Note  the  first  two  methods  assume  knowledge 
of  the  number  of  clusters  r,  which  is  not  available  to  our  method. 

For  each  q,  we  find  the  smallest  p  for  which  a  method  succeeds,  and  average  over  20  trials.  The  results 
are  shown  in  Figure  1(a),  where  the  area  above  each  curves  corresponds  to  the  range  of  feasible  (p,  q )  for 
each  method.  It  can  been  seen  that  our  method  subsumes  all  others,  in  that  we  succeed  for  a  strictly  larger 
range  of  (p,q).  Figure  1(b)  shows  more  detailed  results  for  sparse  graphs  (p  <  0.3,  q  <  0.1),  for  which 
SLINK  and  low-rank-plus-sparse  approach  completely  fail,  while  our  method  significantly  outperforms 
the  spectral  method,  the  only  alternative  method  that  works  in  this  regime. 


VI.  Proof  of  Lemma  1 

In  this  section  we  prove  the  monotone  lemma.  Set  A  =  Define  Q+  =  :  atJ  =  0,  al}  =  1} 

and  Q_  =  { (?',  j)  :  aVJ  =  1 ,  ai3  =  0}.  Let  Y  ^  T  be  an  arbitrary  alternative  feasible  solution  obeying 
0  <  Uij  <  1 ,  Vi,  j.  By  optimality  of  Y  to  the  original  program,  we  have 


L4C  ^  ]  Vij 

&ij 


°a  Vij 

&ij  —  1 


1 

A 


11*11*  ■ 


Next,  by  definition  of  A,  Q+  and  0_ ,  we  have 


C4 


y '  Vij  °ac  y  ^  Vij 


Q-i  i  ■ —  1 


don  ■ —  0 


04 


y  ^  Vv  ca°  y  ]  Vij 


din  —  1 


din  ■ — 0 


—  ^  {cA  +  cAc)'i 

(*,i)eO+ 
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and 


CA 


Vij  CAC  Vij 


ca  y  ^  vij  cac  y  ^  Vij 

O'ij  —  1  j  =  0 


— (CA  +  c.4c)  i/ij  —  ( ca  +  c^c)  ytJ 

(i,j)en~  (*j)en+ 

>  —  (c.A  +  CAC); 

(i,j')en+ 

where  we  use  0  <  y,?  <  1  for  all  (i,j)  in  the  last  inequality.  Summing  the  L.H.S.  and  R.H.S.  of  the  last 
three  display  equations  establishes  that 


CA 


y  i  cA°  y  ^  Vij 


)-T 

Y 

>( 

A 

*  \ 

>  \CaY1  yij~CAc  Y1  Vij  I  _  X  lly ’ll*  • 


Since  Y  is  arbitrary,  we  conclude  that  Y  is  the  unique  optimum  of  the  modified  program. 

VII.  Proof  of  Theorem  1 

We  prove  our  main  theorem  in  this  section.  The  proof  consists  of  three  main  steps,  which  we  elaborate 
below. 


A.  Step  1:  Reduction  to  Homogeneous  Edge  Probabilities 

We  show  that  it  suffices  to  assume  that  the  in-cluster  edge  probability  is  uniformly  p,  and  the  across- 
cluster  edge  probability  is  uniformly  q.  In  the  heterogeneous  model,  suppose  an  edge  is  placed  between 
nodes  i  and  j  with  probability  prj  if  they  are  in  the  same  cluster,  where  pi3  >  p.  This  is  equivalent  to 
the  following  two-step  model:  first  flip  a  coin  with  head  probability  p,  and  add  the  edge  if  it  is  head; 
if  it  is  tail,  then  flip  another  coin  and  add  the  edge  with  probability  .  By  the  monotone  property  in 
Lemma  1,  we  know  that  if  our  convex  program  succeeds  on  the  graph  generated  in  the  first  step,  then  it 
also  succeeds  for  the  second  step,  because  more  in-cluster  edges  are  added.  A  similar  argument  applies  to 
the  across-cluster  edges.  Therefore,  heterogeneous  edge  probabilities  only  make  the  probability  of  success 
higher,  and  thus  we  only  need  to  prove  the  homogeneous  case. 

B.  Step  2:  Optimality  Condition 

We  need  some  additional  notation.  We  denote  the  singular  value  decomposition  of  Y*  (notice  Y* 
is  symmetric  and  positive  definite)  by  UqYqUq  .  For  any  matrix  M,  we  define  Pt(M)  UqIIq  M  + 
MUqUq  —  UqUq  MUqUq  .  For  a  set  Q  of  matrix  indices,  let  Pq(M)  be  the  matrix  obtained  by  setting 
the  entries  of  M  outside  to  zero,  and  we  use  as  a  shorthand  of  Ylupea-  Define  the  sets  A  :  = 
support(A)  and  R  :=  supportfV*)  =  support(f/0(7cir).  The  true  cluster  matrix  Y*  is  an  optimal  solution 
to  the  program  (2)-(3)  if 

ac.4  5>*.  - »«)  -  a C4«y>*.  -  Vij)  -  dim.  -  urn.)  >  o  (8) 

A  Ac 

for  all  feasible  Y  obeying  (3).  Suppose  there  is  a  matrix  W  that  satisfies 

\\W\\<1,PT(W)  =  0. 


(9) 
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The  matrix  f/0f/0T  +  W  is  a  subgradient  of  f(X )  =  ||X||*  at  X  =  Y*,  so  ||Y'||*  -  ||F*||*  >  (U0Uj  + 
W,  Y  —  Y*).  Therefore,  (8)  is  implied  by 


A<U  -Va)-  -  »«)  +  WJ  +  IV,  r  -  n  >  0,  VO  <  y  <  i.  (10) 

A  Ac 


The  above  inequality  holds  in  particular  for  any  feasible  Y  of  the  form  Y  =  Y*  —  etej  with  (?',  j)  e  R 
or  Y  =  Y*  +  etej  with  (i,j)  E  Rc.  This  leads  to  the  following  element-wise  inequalities: 


— Ac^c  —  {UqUq  +  W)ij  >  0, 

—  Ac_4  +  Wij  >  0, 

Acu  -  {U0U^  +  W)ij  >  0, 

A  Cj[c  -f“  W ij  ^  0, 


V(y,  j)  e  Rn  Ac, 
V(i,j)  E  Rcn  A, 
V(i,j)  E  Rn  A, 
V(i,j)  E  Rc  fl  Ac. 


(11) 


It  is  easy  to  see  that  these  inequalities  are  actually  equivalent  to  (10),  so  together  with  (9)  they  form  a 
sufficient  condition  for  the  optimality  of  Y*. 

Finding  a  “dual  certificate”  W  obeying  the  exact  conditions  (9)  and  (11)  is  difficult,  and  does  not 
guarantee  uniqueness  of  the  optimum.  Instead,  we  consider  an  alternative  sufficient  condition  that  only 
requires  a  W  that  approximately  satisfies  the  exact  conditions.  This  is  done  in  Proposition  1  below  (proved 
in  Section  VII-D),  which  significantly  simplifies  the  construction  of  W .  Note  that  condition  (b)  in  the 
proposition  is  a  relaxation  of  the  equality  in  (9),  whereas  condition  (c)  tightens  (11).  Setting  e  =  0  and 
changing  equalities  to  inequalities  in  the  proposition  recover  the  exact  conditions. 


Proposition  1.  Y*  is  the  unique  optimal  solution  to  the  program  (2)-(3),  if  there  exists  a  matrix  W  E 
M.nxn  and  a  number  0  <  e  <  1  that  satisfy  the  following  conditions:  (a)  ||IF||  <  1,  (b)  ||TV(1/F)||00  < 
A  min  { c_4<: ,  c_4 },  and  (c) 

-(1  +  e)Ac4c  -  (U0Uj  +  W)tj  =  0,  V(i,  j)  E  R  n  Ac, 

—  (1  +  e)Ac,4  +  =  0,  V(z,  j)  E  Rc  n  A, 

(1  —  e)Ac_4  —  (UqUJ  +  W)ij  >  0,  V(i,  j)  E  R  fl  A, 

(1  —  e)Ac.4c  +  >  0,  V(i,  j)  E  Rc  n  Ac. 


C.  Step  3:  Constructing  W 

We  build  a  W  that  satisfies  the  conditions  in  Proposition  1  w.h.p.  We  use  1  to  denote  the  all-one  column 
vector  in  Mn,  so  11T  is  the  all-one  n  x  n  matrix.  Let  E  =  =  1, . . .  ,n}  be  the  set  of  diagonal 

entries.  For  an  e  to  be  specified  later,  we  define  W  —  W1  +  W2  +  W3  +  W4  with  Wl  given  by 


Wi  =  -  PmAUoUj )  +  - — —Pr^a{UoUq  ), 


V 


ll  2  — (1  T  e)Ac4c 
W'3  =(1  +  e)Ac_4 


1  —  p 


_-PRn^lc(llT)  H  PRnvlfll 


-P(RcnEc)rvl(ll 


P 

Tv  9 


1  -q 


-P 


(RcC[Ec)nAc 


:nT) 


W,=(l  +  e)\cAPRc(I), 


where  /  is  the  identity  matrix.  We  briefly  explain  the  ideas  behind  the  construction.  Each  of  the  matrices 
Wi,  W2  and  W;i  is  the  sum  of  two  terms.  The  first  term  is  derived  from  the  equalities  in  condition  (c)  in 
Proposition  1.  The  second  term  is  constructed  in  such  a  way  that  each  Wt  is  a  zero-mean  random  matrix 
(due  to  the  randomness  in  A),  so  it  is  likely  to  have  small  norms  and  satisfy  conditions  (a)  and  (b).  The 
matrix  IV4  accounts  for  the  outlier  nodes.  In  particular,  it  is  a  diagonal  matrix  with  (Wf)u  being  non-zero 
if  and  only  if  i  E  V2 
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The  following  proposition  (proved  in  Section  VII-E)  shows  that  W  indeed  satisfies  all  the  desired 
conditions  w.h.p.,  hence  establishing  Theorem  1. 

Proposition  2.  Under  the  conditions  in  Theorem  1,  W  constructed  above  satisfies  the  conditions  (a)-(c) 
in  Proposition  1  w.h.p.  with 


48 


e  :  = 


\A(i  -t) 


max 


a Jn  /  log4  n 


K 


K 


D.  Proof  of  Proposition  1  (Optimality  Condition ) 

Let  PT±(W)  :=  W  —  Pt(W).  Consider  any  feasible  solution  Y  and  let  A  :=  V  —  Y*.  The  difference 
between  the  objective  values  of  Y  and  Y*  satisfies 


(*)  ~  {  IV*  +  All,  +  l  IIHI, 

A  Ac 

<CA  E  S*j  -  E  sv  -  T  (UoUu  +  Pt4W),  A) 

A  Ac 

=CA  J2  %  -  cA*  y,  sv  - 1  (u°u»  +  w>  A>  +  x  ( PtW > A)  > 

A  Ac 


where  in  the  inequality  we  use  the  fact  that  U0Uj +PT±(W)  is  a  subgradient  of  ||F||+  at  Y* ,  a  consequence 
of  condition  (a)  in  the  proposition  and  ||Prx(IU)||  <  1 1 W \ \ .  We  substitute  condition  (c)  into  the  third  term 
in  the  R.H.S.  of  the  last  display  equation  to  obtain 

(*)  <  Sij  -  ecAc  ^  hj  +  -  ecA  E  +  \  (prW, A) 


.RaA 


Rcr\Ac 


_Rn.4c 


RccA 


<  -emin{c^,c^c}||A||i  +  -  (PTW,  A) , 


where  we  use  the  fact  that  5^  <  0  for  (i.j)  E  R  and  >  0  for  (i,  j)  E  PC  since  Y  =  Y*  +  A  satisfies  (3). 
Applying  condition  (b)  yields 

(*)  <  ~emm{cAc,cA}  || A^  +  ^  UPtW]]^  HA^  <  min {cAc,  cA}  || A^  . 

The  last  R.H.S.  is  strictly  negative  whenever  A^O.  This  proves  that  Y*  is  the  unique  optimal  solution. 


E.  Proof  of  Proposition  2 

We  show  that  W  constructed  in  Section  VII-C  satisfies  the  conditions  in  Proposition  1  w.h.p.  We  need 
two  technical  lemmas.  First  notice  that  the  conditions  (5)  and  (6)  in  Theorem  1  imply  bounds  on  various 
quantities. 

Lemma  2.  Under  conditions  (5)  and  (6)  in  Theorem  1,  we  have  (1)  p(l—q)  >  £(1— t)  >  c max  |  ,  logK  n 

and  (2)  e  < 

Proof.  Since  1  >  t  >  0,  we  have  t(l  —  t)  >  \  rninjf,  1  —  t}.  Under  condition  (5)  on  t,  we  further  have 
min  (t,  1  —  >  \(p  —  q )  and  p(  1  —  q)  >  f(l  —  t).  It  then  follows  from  condition  (6)  that 

t{  1  -t)>  ^(p  -q)>  dy/t(l-t)  max  j  ; 
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which  implies  the  inequalities  in  part  (1)  of  the  lemma.  Part  (2)  follows  directly  from  part  (1)  and  the 
definition  of  e.  □ 


Due  to  the  randomness  of  A,  W\,  W2  and  W3  are  symmetric  random  matrices  with  independent  zero- 
mean  entries.  The  support  and  variance  of  their  entries  are  bounded  in  the  following  lemma. 


Lemma  3.  The  following  holds  under  the  conditions  in  Theorem  1. 

1)  For  i  =  1,2,3,  the  absolute  values  of  the  entries  ofWi  are  bounded  by  Bi  a.s.  and  their  variance  is 
bounded  by  of,  where 


Bl  pK ’ 

B2  =  -Ac.4C, 

V 

2 

B3  =  - - \cA, 

i -q 

2)  We  have  <7i  >  cBi n  for  i  —  1,2,  3. 


a,  = 


pK 2  ’ 

'  t)  ,  V  ■ 

err,  =  — - -A^c 


2  4(1  t)  22 


p 


Ac’ 


At 


(To  = 


\2„2 
-A  CA. 


Proof  The  first  part  of  the  lemma  follows  from  the  definitions  of  the  Wfs,  q  <  t  <  p  and  e  <  \ 
(Lemma  2).  The  second  part  follows  from  part  (1)  of  Lemma  2.  O 


We  now  proceed  with  the  proof  of  Proposition  2,  The  proof  has  three  sub-steps,  corresponding  to 
checking  the  three  conditions  in  Proposition  1. 

(1)  Bounding  ||W||. 

Recall  that  W\  is  a  random  matrix  with  i.i.d.  entries,  and  their  absolute  values  and  variance  are  bounded 
in  Lemma  3.  We  apply  standard  bounds  on  the  spectral  norm  of  random  matrices  (Lemma  4  in  the 
Appendix)  to  obtain  w.h.p.  __ 


||,Vi||-4vp  - 


i 

4’ 


where  the  last  inequality  follows  from  p  >  cjfj  (cf.  Lemma  2).  In  a  similar  manner,  we  obtain  that  w.h.p. 


\\W2\\  <6-2 


Ac^c  •  ^fn  =  12 


1 

48 


(1  —  t)n 


y/n  < 


1 

4’ 


where  the  last  inequality  follows  from  p  >  t,  and  w.h.p. 


II W3||  <6-2 


- Ac.4  • 

-  q 


l  li-t 
48  V  tn 


1 

4’ 


where  the  last  inequality  follows  from  1  —  t  <  1  —  q.  Finally,  since  W4  —  (1  +  e)A cAPRc(I)  is  a  diagonal 
matrix,  we  have 

M<(  l  +  e)Ac,<2.lf4f<l 

since  t  >  c\  (cf.  Lemma  2).  We  conclude  that  ||VF||  <  IIWH  <  1. 

(2)  Bounding 

Define  the  sets  Rm  :=  { {%.])  :  i.j  in  cluster  m},  and  recall  that  r  is  the  number  of  clusters  and 
R  :=  support(F*)  =  U rm=1Rm.  We  have  Y*  =  J2m= i  and  thus  its  singular  vectors  satisfies 


tWoT  =  D  jrPR.tu" 


m= 1 
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Therefore,  for  i  =  1,2,3,  each  entry  of  the  matrix  UqUJ W%  equals  times  the  sum  of  krn  independent 
zero-mean  random  variables  (which  are  the  entries  of  VLj),  whose  absolute  values  and  variance  are  bounded 
in  Lemma  3.  We  may  use  the  standard  Bernstein  inequality  (Lemma  5  in  the  Appendix)  to  bound  each 

llwML. 

For  I'Ll ,  we  have 


\UoUJwA 


< 


< 


log2  n  I  1 

242  V  ~Kn' 


where  we  use  p  >  cjk  in  the  last  inequality  (c.f.  Lemma  2).  Similarly,  1L2  satisfies 


\U0UJW2\\<i- 


1 

00  ~K 


■  c3  J~~^cAo  y/ 1<  log  n 


— C31 


(1  —  t)  log  n  1 


< 


log2  n 


pK  48  y  (1  —  t)n  242  V  Kn 


where  we  use  p  >  t  and  log  n  being  sufficiently  large  in  the  last  inequality.  The  matrix  W3  obeys 


\u0uJw3\\  <— 

I  u  U  ^  lloo  — 


'  c3  J  YZTq  Xca  ^ K  log  n 


— C31 


t  log  n  1 
(1  —  q)K  48 


1  —  t  log2  n 
tn  ~  242  V  Kn 


where  we  use  1  —  q  >  1  —  t  and  log  n  being  sufficiently  in  the  last  inequality.  Finally,  since  W4  is  a 
diagonal  matrix  supported  on  Rc  and  U0Uq  is  supported  on  R,  we  have  U0Uj W4  —  0. 

On  the  other  hand,  we  have 


and 


A c^e  > 


1  aq  I 

48  V  tn  Kt(l-t) 


1  /log4  n 

t  V  Kn 


Ac^ce  > 


—  I  f  48  /  IoS4n 
48  y  (1  —  t)n  \j  Kt(l  -  t) 


1  /log4  n 
(1  -t)  V  Kn 


5 


so  ^e\mm{cA,CAc}  >  Combining  with  the  previous  bounds  on  ||C/0(70rW/i||oo,  we  obtain 

||((/0(7oTW/i)||00  <  ^eArnin  {cA,cA°}  ■ 

Now  observe  that  since  W  and  U0Uj  are  both  symmetric,  we  have  WUqUq  =  (UQUjW)  and  thus 
||fLf/0(70r||  =  \\U0Uj W\\  .  Furthermore,  we  have 

UoUqWUqUq  =U0  C/0T^2-PRm(llT), 

m=lm 


which  implies  \\U0UqWU0Uq  <  ||f/0(70rVL||oo  .  It  follows  that 


PtW il  =  \\u0ujw  +  wu0uj  -  u0ujwu0uj  11^ 

<  II  WoT^IL  +  \\wuouo\L  +  II U0ujwu0uj 

4 

<3  ||C/„t/„T <  3  J2  II U0Uj . 
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Using  the  bounds  on  ||[/0[/0r Wj||  derived  above,  we  obtain  that  H-PtWHoo  <  12  •  ^eAmin {cu,  c^}  = 
|eA min  {04,  cAc}. 

(3)  The  two  equalities  in  condition  (c)  in  Proposition  1  hold  by  the  definition  of  W.  We  now  turn  to 


the  inequalities  in  condition  (c).  Because  1  —  q  >  1  —  t  and  p  <  4 1,  we  have  —  >  1 AU  It  follows  from 


the  conditions  in  Theorem  1  that 

p-q 


>c\/p(l  —  q)  max  <  y/n/ K,  y^log4  n/K 


>8p(l—t)  ■  —j  =  =  max  |  s/n/K,  \j\ogA  n/K |  =  8p(l—t)e. 


(12) 


We  thus  have 


P  ~  *  >  P  ~  (  \p  +  j  =  >  8p(!  -  t)e. 


One  verifies  that  this  implies  (1  +  e)y  <  (1  —  2 e)y  Plugging  in  the  values  of  cA  and  cAc 

in  (4)  yields 

(l  +  e)CA‘{1~p)  <(l-2e)cA, 


V 


Hence,  for  each  (i,j)  e  R  D  A,  we  have 


(UoUj  +  W)ij  —  -(UqIIJ  )ij  +  (1  +  e)Ac_4c — — ^ -  <  -(UqUq  )ij  +  (1  —  2e)cA. 


We  also  have 


n 


1  1  (*)  48 

p{UoU°  ^  —  pK  ~  K  V  f(l  —  t)  48  V  tn 


1  - 1 


<  e '  Ac*, 


where  (z)  follows  from  p  >t.  Combining  the  last  two  displays  proves  the  first  inequality  in  condition  (c). 
Similarly,  we  have 


73  —  a  (*0  (w) 

Q  =  —r~  -  8P(X  ~  l)e  -  2t(X  “  ?)e> 


where  (zz)  follows  from  (12)  and  (in)  follows  from  p  >  t  and  1  —  t  >  1  —  | p  —  \q  >  \(1  —  q).  This 
implies  (1  +  e)  \J~^ yy  <  (1  —  e)  yj  -^rr  Therefore,  for  each  (i,j)  e  Rc  D  Ac, 

caQ 


wi:j  =  -(1  +  e): 


>  -(1  -  e)cAc, 


13  ■  ~'l-q 

proving  the  second  inequality  in  condition  (c).  This  completes  the  proof  of  Proposition  2. 


VIII.  Proof  of  Theorem  2 

We  use  a  standard  information  theoretic  argument  via  Fano’s  inequality.  For  simplicity  we  assume  n  \  / K 
and  n2/K  are  both  integers,  and  we  use  c  1,  c2  . . .  to  denote  positive  absolute  constants.  Let  T  be  the  set  of 
all  possible  ways  of  assigning  n  nodes  into  n\/K  clusters  of  equal  size  K.  When  K  =  0(zii)  =  @(n2), 
the  cardinality  of  T  can  be  bounded  as 


M  :=  \F\  = 

for  some  c\  >  1  and  c2  >  0. 


1  (n\(n  —  K 

(ni/K)\  \Kj  Y  K 


ni  +  K 
K 


>  c2  •  cl 
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Suppose  the  true  cluster  matrix  Y*  is  obtained  uniformly  at  random  from  T.  and  the  graph  A  is 
generated  from  Y*  according  to  GSBM  with  uniform  edge  probabilities.  We  use  P^y*  to  denote  the 
distribution  of  A  given  Y*.  Let  Y  be  any  measurable  function  of  A.  The  standard  Fano’s  inequality  gives 


sup  P 

Y*£T 


Y  A  Y* \Y* 


I(A:Y*)  +  log  2 
>  1 - EE — LL — E_  >  i 

log  M 


I{A-Y*)  +  log  2 


c3n 


for  n  is  sufficiently  large.  We  now  bound  the  mutual  information  I  (A;  Y*)-  Observe  that 


I  (A;  Y*)  =H(A )  -  H(A\Y*)  <  ^  H{aij)  -  H(A\Y*) 

=  (^jH(a12)  -  Qff(o12|y*)  =  Q/(a12;y), 

where  in  the  second  equality  we  have  used  the  symmetry  under  the  uniform  distribution  of  Y*  and  the 
conditional  independence  between  a'VJs.  By  definition  of  the  mutual  information,  we  have 

I(a12-Y*)  =  I(a12;y*12)  =  Eyt2  [D  (P(a12|^2)||P(a12))] . 

We  compute  the  divergence  on  the  last  RHS.  Let  a  :=  P (y\2  =  1)  =  (A~2)rai  and  7  :=  P(ctn  =  1)  = 
ap  +  (1  —  a)q.  It  follows  that 


E,^  [D  (P(a12|y12)||P(a12))] 

:  X]  X]  P^12  =  vW(a  12  =  a\ y*l2  =  y)  log 
j/e{o,i}  ae{o,i} 


P(ai2  =  a\y  12  =  y) 

P(ai2  =  a) 


=ap log  —  +  a(l—p)  log  yz- — +  (l—a)q log  —  +  (l-a){l-q)  log  ^ - X 

7  (1-7)  7  (1  —  7) 


p 


<ap[ - 1  +a(l—p) 


,7 

a(l-a)(p-qf 


1—p 


<  C4 


1-7 

(p  -  q)2 


1  +(!-«)?  — 1  +(!-a)(l-g) 


,7 


l~g 

1  — 7 


7(1-7)  P(i-g)’ 

where  in  the  last  inequality  we  use  7  >  ap,  1  —  7  >  (1  —  a)(l  —  q)  and  a,  1  —  a  =  0(1).  Combining 
pieces,  we  obtain 


sup  P 


Y  ±  Y\Y 


>  1  - 


c5 


(p-gfn- 

p(  1-9) 


yPP  +  log  2 


c3n 


For  the  last  R.H.S.  to  be  less  than  we  need  ^ ,  >  c6-.  This  completes  the  proof  of  the  theorem. 


p(  1-9) 


IX.  Proof  of  Theorem  3 

Let  A i  be  the  / - 1 h  eigenvalue  of  the  matrix  E [/l]  (couting  multiplicity).  Observe  that  the  matrix  A  :  = 
A  —  E/1  is  a  random  symmetric  matrix  with  independent  zero-mean  entries,  each  of  which  is  bounded  in 
absolute  value  by  1  and  has  variance  bounded  by  p(  1  —  p)  V  q(  1  —  q)  <  p{  1  —  q).  Under  the  condition  of 
Theorem  3,  we  may  apply  Lemma  4  to  obtain  ||A||  <  4a/p(1  —  q)n  w.h.p.  It  then  follows  from  Weyl’s 
inequality  [40]  that  w.h.p. 


max 

i 


<  \\A  —  E/1 1 1  =  \\A\\  <  4^/p(l  —  q)n ■ 


(13) 


In  the  sequel,  we  assume  we  are  on  the  event  that  (13)  holds. 
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a)  Estimation  of  r  :  Recall  that  Ai  =  K(p  —  q)  +  nq  +  (1  —  p ),  A2, . . . ,  Ar  =  K (p  —  q)  +  (1  —  p), 
and  Ar+i, . . . ,  Xn  —  1  —  p.  The  inequality  (13)  implies  that  for  some  universal  constant  c\. 


•  Ai  —  A2  <  Ai  —  A2  +  Ai  —  Ai  + 

•  A,  —  Aj+i  <  A i  —  Aj+i  +  Xi  —  Xi  + 

®  .A,.  ^  Aj>  A 


A2  —  A2 

Aj+i  - 


<  nq  +  Ci  ^/p(l  - 


A,; 


i+1 


<  ciy/p(l  —  q)n  for  i  —  2, . . .  r  —  1  and  i  >  r  +  1; 


Ar+i  ~~ 


r+1 


>  K (p-q)-  ci  y/p(  1  -  q)n. 


Under  the  condition  of  Theorem  3,  we  have  K (p  —  q)  >  e2y/p(l  —  q)n  for  some  constant  c2.  This  implies 


Ay»  A 


r+1 


> 


K{p-q) 


>  A j  —  Aj+1  for  all  /'  >  1  and  z  7^  r.  This  guarantees  f  =  r  and  thus  K  =  K. 


Estimation  of  p  and  q:  By  (13),  the  estimation  error  of  q  satisfies 


\q-q\  = 

Similarly,  we  have 


Ai  —  Ai  A2  —  A2 


n 


n 


< 


|Ai  —  Ai|  +  |A2  —  A2| 


n 


<  C3 


Vp(  1 


K 


P~P  - 


KX\  +  {n-  K)  A2 


n 


KX\  +  (n  -  K)  A2  -  n 


n(K  -  1)  n(/l  -  1) 

K(X1-Xl)  +  (n-K)(X2-X2) 


<  c3 


y/p(l  -  q)n 


n(K  -  1) 

b)  Choosing  t:  Using  the  above  bounds  on  p  and  q,  we  obtain 


K 


p  +  q  p-p  +  q-  q  p  +  q  y/p{  1  -  q)n 

t  =— - 1 - - -  <  — - - h  c4- 


K 


p+q  p-q  3  1 

+  — =  4P+49’ 


where  in  the  last  inequality  we  use  the  assumption  —  in  the  theorem.  This  proves  one  side 


of  inequality  for  t,  and  the  other  side  is  proved  in  a  similar  way. 


X.  Conclusion 

This  work  is  motivated  by  clustering  large-scale  networks  such  as  modem  online  social  networks,  where 
the  graphs  are  often  highly  noisy  and  has  heterogeneous  and  non-random  structures.  We  considered  a 
natural  and  versatile  model,  namely  the  semi-random  Generalized  Stochastic  Blockmodel,  for  clustered 
random  graphs.  This  model  recovers  many  classical  generative  models  for  graph  clustering.  We  presented 
a  convex  optimization  formulation,  essentially  a  convexification  of  the  maximum  likelihood  estimator.  Our 
theoretic  analysis  shows  that  this  method  is  guaranteed  to  recover  the  correct  clusters  under  a  wide  range 
of  problem  parameters  of  the  problem.  In  fact,  our  method  outperforms,  i.e.,  succeeds  under  less  restrictive 
conditions,  every  existing  method  in  this  setting.  Simulation  studies  also  validates  the  effectiveness  of  the 
proposed  method.  Immediate  goals  for  future  work  include  faster  algorithm  implementations,  as  well  as 
developing  effective  post-processing  schemes  (e.g.,  rounding)  when  the  obtained  solution  is  not  an  exact 
cluster  matrix. 
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Appendix 

In  this  section,  we  record  two  technical  lemmas  that  are  needed  in  the  proofs  of  our  theoretical  results. 
The  first  lemma  is  a  standard  bound  on  the  spectral  norm  of  a  random  symmetric  matrix. 

Lemma  4.  Suppose  Y  is  a  symmetric  n  x  n  matrix,  where  Y%3,  1  <  i,j  <  n  are  independent  random 
variables,  each  of  which  has  mean  0  and  variance  at  most  o2  and  is  bounded  in  absolute  value  by  B.  If 
o  >  c i B  ^|-  "  for  some  absolute  constant  Ci  >  0,  then  with  probability  at  least  1  —  n~10, 

PI  <  4 os/h. 

Proof.  Except  for  Y  being  symmetric,  the  proof  is  the  same  that  of  Theorem  3.1  in  [41].  □ 

The  second  lemma  is  the  standard  Bernstein  inequality  for  the  sum  of  independent  random  variables. 

Lemma  5.  ([42],  Proposition  5.16)  Let  Yj , . . . ,  Y/y  be  independent  random  variables,  each  of  which  is 
bounded  in  absolute  value  by  B  a.s.  and  has  variance  bounded  by  o2.  There  exist  universal  positive 

constants  cq,  c\,  c^  independent  of  o,  B,  N  and  n  such  that  if  a  >  B  pA,  then  we  have 
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with  probability  at  least  1  —  cyrt 
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